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Abstract: A classical condition for fast learning rates is the margin condi- 
tion, first introduced by Mammen and Tsybakov. We tackle in this paper 
the problem of adaptivity to this condition in the context of model se- 
lection, in a general learning framework. Actually, we consider a weaker 
version of this condition that allows one to take into account that learning 
within a small model can be much easier than within a large one. Requiring 
this "strong margin adaptivity" makes the model selection problem more 
challenging. We first prove, in a general framework, that some penalization 
procedures (including local Rademacher complexities) exhibit this adap- 
tivity when the models are nested. Contrary to previous results, this holds 
with penalties that only depend on the data. Our second main result is that 
strong margin adaptivity is not always possible when the models are not 
nested: for every model selection procedure (even a randomized one), there 
is a problem for which it does not demonstrate strong margin adaptivity. 
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1. Introduction 

We consider in this paper the model selection problem in a general framework. 
Since our main motivation comes from the supervised binary classification set- 
ting, we focus on this framework in this introduction. Section [5] introduces the 
natural generalization to empirical (risk) minimization problems, which we con- 
sider in the remainder of the paper. 
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We observe independent realizations (Xi,Yi) E X x y for i = 1, ... ,n of a 
random variable with distribution P, where y = {0,1}. The goal is to build 
a (data-dependent) predictor t (i.e., a measurable function X H> y) such that 
t(X) is as often as possible equal to Y, where (X, Y) ~ P is independent from 
the data. This is the prediction problem, in the setting of supervised binary 
classification. In other words, the goal is to find t minimizing the prediction 
error P7 (t; •) := P (x ,y)~p {t(X) ^Y), where 7 is the 0-1 loss. 

The minimizer s of the prediction error, when it exists, is called the Bayes 
predictor. Define the regression function rj(X) = P(x,y)~p iX = 1 1 X). Then, 
a classical argument shows that s(X) = l r) (x)>i/2- However, s is unknown, since 
it depends on the unknown distribution P. Our goal is to build from the data 
some predictor t minimizing the prediction error, or equivalently the excess loss 
*(M):=P 7 (t)-P 7 («). 

A classical approach to the prediction problem is empirical risk minimization. 
Let P n — n~ z27=i ^(XiM be the empirical measure and S m be any set of 
predictors, which is called a model. The empirical risk minimizer over S m is 
then defined as 

s m e arg min P n7 (t) = arg min \ - V" l t (x«)^K i ■ 



i=i 



We expect that the risk of s m is close to that of 

s. m G arg mm P7 ( t ) 

assuming that such a minimizer exists. 



1.1. Margin condition 

Depending on some properties of P and the complexity of S m , the prediction 
error of s m is more or less distant from that of s m . For instance, when S m has 
a finite Vapnik-Chervonenkis dimension V m |27i 126) and s £ S m , it has been 
proven (see for instance [TH]) that 

E[£(s,s m )]<C 

for some numerical constant C > 0. This is optimal without any assumption 
on P, in the minimax sense: no estimator can have a smaller prediction risk 
uniformly over all distributions P such that s € S^n, up to the numerical factor 

CUM- 

However, there exist favorable situations where much smaller prediction er- 
rors ("fast rates", up to n^ 1 instead of rt -1 / 2 ) can be obtained. A sufficient 
condition, the so-called "margin condition" , has been introduced by Mammen 
and Tsybakov [21] . If, for some eq, Co > and a > 1, 

Ve€(0,e ], P ( \2r,(X) - 1| < e) < C e Q , (1) 
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if the Bayes predictor s belongs to S m , and if S m is a VC-class of dimension V rn , 
then the prediction error of s m is smaller than L(Cq, eq, a) ln(n) (V^/n) 2 " -1 in 
expectation, where k = (1 + a)/a and L(Co,£o,a) > only depends on Co, 
£o and a. Minimax lower bounds [23] and other upper bounds can be obtained 
under other complexity assumptions (for instance assumption (A2) of Tsybakov 
|24j . involving bracketing entropy). In the extreme situation where a = +oo, 
i.e., for some h > 0, 

F(\2r)(X)-l\<h)=0 , (2) 

then the same result holds with k = 1 and L(h) cx ft, . More precisely, as 
proved in [23] 

Following the approach of Koltchinskii [16], we will consider the following 
generalization of the margin condition: 

VteS, >p(vWp(7(t;-)-7(»;-))) , (3) 

where 5* is the set of predictors, and ip is a convex non-decreasing function on 
[0, oo ) with ip(0) = 0. Indeed, the proofs of the above upper bounds on the 
prediction error of s m use only that (J]) implies ((HJ) with ^?(x) = L(Cq, £q, a)x 2K 
and n = (l + a)/a, and that © implies © with ^>(a;) = /ix 2 . (See, for instance, 
Proposition 1 in [24].) 

All these results show that the empirical risk minimizer is adaptive to the 
margin condition, since it leads to an optimal excess risk under various assump- 
tions on the complexity of S m . However, obtaining such rates of estimation 
requires knowledge of some S m to which the Bayes predictor belongs, which is 
a strong assumption. 

A less restrictive framework is the following. First, we do not assume that 
s £ S m . Second, we do not assume that the margin condition ([3]) is satisfied for 
all t € S , but only for t € S m , which can be seen as a "local" margin condition: 

Vt e S m , £(s, t) > ip m 

(Vvarp( 7 (i;.)-7(s;-))) , (4) 

where ip m is a convex non-decreasing function on [0, oo) with ip m (0) = 0. The 
fact that <p m can depend on m allows situations where we are lucky to have a 
strong margin condition for some small models but the global margin condition 
is loose. As proven in Section [5?2l (Proposition!!?]) . such situations certainly exist. 

Note that when ip m (x) = h m x 2 , ([3]) and (]4]) can be traced back to mean- 
variance conditions on 7 which where used in several papers for deriving con- 
vergence rates of some minimum contrast estimators on some given model S m 
(see for instance [TT] and references therein). 
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1.2. Adaptive model selection 

Assume now that we are not given a single model but a whole family ( S m ) meM ■ 
By empirical risk minimization, we obtain a family (S TO ) meA/( of predictors, 
from which we would like to select some 'sfa with a prediction error P7 ( s m ) as 
small as possible. The aim of such a model selection procedure ((Xi,Y\), . . . , (X n , Y n 
fh € M„ is to satisfy an oracle inequality of the form 

i(s,s m )<C inf {£(s,s m )+R m ,n} > (5) 

where the leading constant C > 1 should be close to one and the remainder term 
Rm,n should be close to the value P"i{s m ) — P"f{s m )- Typically, one proves 
that ([5]) holds either in expectation, or with high probability. 

Assume for instance that tp m {x) = h m x 2 for some h rn > and S m has a finite 
VC-dimension V m > 1. In view of the aforementioned minimax lower bounds 
of |23) . one cannot hope in general to prove an oracle inequality ([5]) with a 



mm 



\n(n)V m 
nh m 




where the ln(n) term may only be necessary for some VC classes S m (see [23]). 

Then, adaptive model selection occurs when fh satisfies an oracle inequality 
|5]) with Rm t n of the order of this minimax lower bound. More generally, let 
C m be some complexity measure of S m (for instance its VC-dimension, or the 
p appearing in Tsybakov's assumption [H])- Then, define R n (C m ,ip m ) as the 
minimax prediction error over the set of distributions P such that s € S m and 
the local margin condition (|4| is satisfied in S m with tp m , where S m has a 
complexity at most C m . Massart and Nedelec [23 a have proven tight upper and 
lower bounds on R n (C m ,ip m ) with several complexity measures; their results 
are stated with the margin condition ([3]), but they actually use its local version 
Q only. 

A margin adaptive model selection procedure should satisfy an oracle in- 
equality of the form 

t{s,s m )<C inf {£(s,s m ) + R„{C m ,tp m )} (6) 

without using the knowledge of C m and ip m . We call this property "strong margin 
adaptivity" , to emphasize the fact that this is more challenging than adaptivity 
to a margin condition that holds uniformly over the models. 



1 . 3. Penalization 



We focus in particular in this article on penalization procedures, which are 
defined as follows. Let pen : M. n i-> [0, 00) be a (data-dependent) function, 
and define 

fh € arg min { P n j ( s m ) + pen(m) } . 

mEM« 
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Since our goal is to minimize the prediction error of ~s m , the ideal penalty would 
be 

pen id (7n) := P7 (s m ) - P„7 (s m ) , (7) 

but it is unknown because it depends on the distribution P. A classical way of 
designing a penalty is to estimate pen id (m), or at least a tight upper bound on 
it. 

We consider in particular local complexity measures [201 HH1 HI H] 1 because 
they estimate pen id tightly enough to achieve fast estimation rates when the 
margin condition holds true. See Section 13.21 for a detailed definition of these 
penalties. 

1. 4- Related results 

There is a considerable literature on margin adaptivity, in the context of model 
selection as well as model aggregation. Most of the papers consider the uniform 
margin condition, that is when tp m = (p. Barron, Birge and Massart [7] have 
proven oracle inequalities for deterministic penalties, under some mean- variance 
condition on 7 close to ([3]) with ip(x) = hx 2 . Following a similar approach, 
margin adaptive oracle inequalities (with more general ip) have been proven 
with localized random penalties (2Ql UHl EJ [16] , and in [25] with other penalties 
in a particular framework. 

Adaptivity to the margin has also been considered with a regularized boosting 
method [T3], the hold-out [13] and in a PAC-Bayes framework [5]. Aggregation 
methods have been studied in [24] [17] . Notice also that a completely different 
approach is possible: estimate first the regression function 77 (possibly through 
model selection), then use a plug-in classifier, and this works provided r\ is 
smooth enough [6]. 

It is quite unclear whether any of these results can be extended to strong 
margin adaptivity (actually, we will prove that this needs additional restrictions 
in general) . To our knowledge, the only results allowing tp m to depend on m can 
be found in [16) . First, when the models are nested, a comparison method based 
on local Rademacher complexities attains strong margin adaptivity, assuming 
that s £ Umex S m (Theorem 7; and it is quite unclear whether this still holds 
without the latter assumption). Second, a penalization method based on local 
Rademacher complexities has the same property in the general case, but it uses 
the knowledge of (v ? m) m 6X Il (Theorems 6 and 11). 

Our claim is that when (p m does strongly depend on m, it is crucial to take 
it into account to choose the best model in M n . And such situations occur, as 
proven by our Proposition[5]in Section I5T21 But assuming either s e {J meM S m 
or that ip m is known is not realistic. Our goal is to investigate the kind of results 
which can be obtained with completely data-driven procedures, in particular 
when s <£ [j meMn S m . 
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1.5. Our results 

In this paper, we aim at understanding when strong margin adaptivity can 
be obtained for data-dependent model selection procedures. Notice that we do 
not restrict ourselves to the classification setting. We consider a much more 
general framework (as in [16] for instance), which is described in Section [2j We 
prove two kinds of results. First, when models are nested, we show that some 
penalization methods are strongly margin adaptive (Theorem [1} . In particular, 
this result holds for the local Rademacher complexities (Corollary [T]). Compared 
to previous results (in particular the ones of |16j). our main advance is that our 
penalties do not require the knowledge of (^m) mg jK n , and we do not assume 
that the Bayes predictor belongs to any of the models. 

Our second result probes the limits of strong margin adaptivity, without the 
nested assumption. A family of models exists such that, for every sample size 
n and every (model) selection procedure m , a distribution P exists for which 
fh fails to be strongly margin adaptive with a positive probability (Theorem [2]). 
Hence, the previous positive results (Theorem [T] and Corollary [T]) cannot be 
extended outside of the nested case for a general distribution P. 

Where is the boundary between these two extremes? Obviously, the nested 
assumption is not necessary. For instance, when the global margin assumption 
is indeed tight (ip = ip rn for every m G -M„), margin adaptivity can be obtained 
in several ways, as mentioned in Section 11.41 We sketch in Section [5] some sit- 
uations where strong margin adaptivity is possible. More precisely, we state a 
general oracle inequality (Theorem [3]) , valid for any family of models and any 
distribution P. We then discuss assumptions under which its remainder term is 
small enough to imply strong margin adaptivity. 

This paper is organized as follows. We describe the general setting in Sec- 
tion^ We consider in Section[3]the nested case, in which strong margin adaptiv- 
ity holds. Negative results (i.e., lower bounds on the prediction error of a general 
model selection procedure) are stated in Section |4j The line between these two 
situations is sketched in Section [5] We discuss our results in Section HI All the 
proofs are given in Section [7] 

2. The general empirical minimization framework 

Although our main motivation comes from the classification problem, it turns 
out that all our results can be proven in the general setting of empirical min- 
imization. As explained below, this setting includes binary classification with 
the 0-1 loss, bounded regression and several other frameworks. In the rest of 
the paper, we will use the following general notation, in order to emphasize the 
generality of our results. 

We observe independent realizations £i, . . . , £ n £ S of a random variable with 
distribution P, and we are given a set J- of measurable functions S i— >• [0,1]. 
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Our goal is to build some (data-dependent) / such that its expectation P(f) := 
E^p [/(£)] is as small as possible. For the sake of simplicity, we assume that 
there is a minimizer /* of P(f) over J- . 

This includes the prediction framework, in which EE = X x y, £j = (Xi,Yi), 

F ir (t;t) s.t.teS} , 

where 7 : S x S h-> [0, 1] is any contrast function. Then, /* is equal to 7(5;-), 
where s is the Bayes predictor. In the binary classification framework, y — 
{0,1} and we can take the 0-1 contrast j(t; (x, y)) = \ t i x \^ y for instance. 
We then recover the setting described in Section [TJ In the bounded regression 
framework, assuming that y = [0, 1], we can take the least-squares contrast, 

y(*; (x,y)) = (t(x) - yf . 

Many other contrast functions 7 can be considered, provided that they take 
their values in [0, 1]. Notice the one-to-one correspondence between predictors t 
and functions f t := 7 (t; •) in the prediction framework. 

The empirical minimizer over J 7 ,,, C J- (called a model) can then be defined 

as 

f m G arg min P„(f) ■ 

We expect that its expectation P(f m ) is close to that of f m G argmin/ 6 jr m P(f), 
assuming that such a minimizer exists. In the prediction framework, defining 
Fm ■= { ft s -t- * € Sm }> w e have f m = f~ m and f m = f Sm . 
We can now write the global margin condition as follows: 



V/G.F, ?(/-f)>JVvar P (/-/*) , (8) 



where ip is a convex non-decreasing function on [0,oo) with ip(0) — 0. Similarly, 
the local margin condition is 

V/€J m , P(f - f*) > ip m ( Vvarp(/ - /*)) . (9) 

Notice that most of the upper and lower bounds on the risk under the margin 
condition given in the introduction stay valid in the general empirical minimiza- 
tion framework, at least when (p m (x) — (h m x 2 ) K,m for some h m > and K m > 1 
(see for instance [231 [H]). Assume that T m is a VC-type class of dimension V m . 
If ip m (x) = h m x 2 , 

P(f m -f*)]< 2P(f m -f*)+C min 

for some numerical constant C > 0. If tp m (x) = (/i m i 2 )" m for some h m > 
and K m > 1, 




E 



Plfm-f) <2P(f m -n+Cmm- 



L(h m , K m )ln(n) 



V„. 
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for some constant L{h m , K m ) > 0. 

Given a collection {T m ) m eM of models, we are looking for a model selection 
procedure (£1, . . . , £„) i— > rh G .M n satisfying an oracle inequality of the form 

p(frh-r)<c mf {p(/ m -r)+i? m ,„} , (io) 

with a leading constant C close to 1 and a remainder term R m ,n as small as 
possible. Similarly to (|6]), we define a strongly margin adaptive procedure as 
any rh such that (|10p holds with some numerical constant C, and R m ,n of the 
order of the minimax risk R n (C m , ip m ). 
Defining penalization methods as 

rh e arg min I P n ( f m ) + pen(m) } (11) 

m£Mn l \ / J 

for some data-dependent pen : .M n >-> K, the ideal penalty is pen id (m) :— 

(P-Pn)(fm)- 



3. Margin adaptive model selection for nested models 
3.1. General result 

Our first result is a sufficient condition for penalization procedures to attain 
strong margin adaptivity when the models are nested (Theorem [lj . Since this 
condition is satisfied by local Rademacher complexities, this leads to a data- 
driven margin adaptive penalization procedure (Corollary [I}. 

Theorem 1. Fix {T m ) meM and {tp m ) m£M such that the local margin condi- 
tions © hold. Let (t m ) meM be a sequence of positive reals that is nondecreas- 
ing {with respect to the inclusion ordering on .F m ). Assume that some constants 
c,rj € (0, 1) and C\, C-i > exist such that the following holds: 

• the models {J~'m) me _M are nested. 

• lower bounds on the penalty: with probability at least 1 — n, for every 
m, m' S M n , 

(1 - c) pen(m) > (P - P n ) [f m - / m ) + ^ > (12) 

T m , cJ m ^ cpen(m) > v(m) - d«(m') - C 2 P(f m > - f*) , (13) 

[2t~ 

where v(m) := \ — var P (/ m - /*) . 
V n 

Then, if rh is defined by (TTTj) . with probability at least 1 — t] — Z^meM e~ tm , 
we have for every e £ (0, 1) 

P(ffh-f) < j^— mf {(l + £ + C 2 + £Ci)P(/ m -/*)+pen(m) 

(14) 

1 + max { 1, Ci } ) min t ip* m 
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where <p m (x) := sup a>0 { xy — tp m (y)} is the convex conjuguate of <p m . 

Theorem [1] is proved in Section P7TT1 

Remark 1. 1. If pen(m) is of the right order, i.e., not much larger than 
E[pen id (m)], then Theorem [JJ is a strong margin adaptivity result. In- 
deed, assuming that (p m (x) = (/i m x 2 ) '", the remainder term is not too 
large, since 

<p* m (x) = L(h m ,K m )x 2K ™^ 2K ™-V 

for some positive constant L(h m , K m ). Hence, choosing e = 1/2 for in- 
stance, we can rewrite (|14p as 



< L(C\,C2) inf < P (f m — /*) + pen(m) + L 

mEM„ 

for some positive constants L{Cx,Ci) and L(h m , n m ). When <p m is a gen- 
eral convex function, minimax estimation rates are no longer available, so 
that we do not know whether the remainder term in (fTT) is of the right 
order. Nevertheless, no better risk bound is known, even for a single model 
to which s belongs. 
2. In the case that the ip m are known, methods involving local Rademacher 
complexities and ( ip m ) meM satisfy oracle inequalities similar to (|14l) (see 
Theorems 6 and 11 in |16|). On the contrary, the tp m are not assumed to 
be known in Theorem [TJ and conditions (|T2|) and (fT3|) are satisfied by 
completely data-dependent penalties, as shown in Section [3.21 
Also, Theorem 7 of |16) shows that adaptivity is possible using a compar- 
ison method, provided that /* belongs to one of the models. However, it 
is not clear whether this comparison method achieves the optimal bias- 
variance trade-off in the general case, as in Theorem [TJ 




3.2. Local Rademacher complexities 

Although Theorem [JJ applies to any penalization procedure satisfying assump- 
tions (fl"2"]) and (|13[) . we now focus on methods based on local Rademacher com- 
plexities. Let us define precisely these complexities. We mainly use the notation 
of [TJ: 

• for every 5 > 0, the S minimal set of Fm w.r.t the distribution P is 

F m , P {8) := | / e T m s.t. P(f) - mi P{g) < d J 

• the L 2 (P) diameter of the S minimal set of T m : 

D\{T m ;S)= sup P((f-g) 2 ) 

f,geF m ,p(8) 
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• the expected modulus of continuity of (P — P n ) over T m : 

<f> n (F m ;P;6)=E sup \(P n - P)(f - g)\ . 

We then define 

U n (T m ;S; t) :=K I ^ n {T m ; P; 5) + D P {T m ;6)J - + -) , 
\ V n n I 

where K > is a numerical constant (to be chosen later). The (ideal) lo- 
cal complexity 5 n (F m ',t) is (roughly) the smallest positive fixed-point of r > 
U n (T m ;r;t). More precisely, 

6 n (Fm; t) := inf | S > s.t. sup j U ^ a ^ J < 1 J (i 5 ) 

where q > 1 is a numerical constant. 

Two important points, which follow from Theorems 1 and 3 of Koltchinskii 
[TB] . are that: 

1. 6 n (JF m \t) is large enough to satisfy assumption (fT2j) with a probability at 
least 1 — logq (n/t) e~* for each model m € 

2. there is a completely data-dependent S n (J- m ;t) such that 



Vm g M n , P (?„(-F m ; t) > 5 n {F m -t) ) > 1 - 5 ln ? ( 



This data-dependent 5 n ( J-" m ; t) is a resampling estimate of S n [T m ; t) , called 
the "local Rademacher complexity" . 

Before stating the main result of this section, let us recall the definition of 
^n(^m',t), as in [TB]. We need the following additional notation: 

• for every S > 0, the empirical S minimal set of T m is 



T n , m {f) := | / G T m s.t. P„(f) - M P n (g) < S j = T^p^S) 

the empirical L 2 (P) diameter of the empirical 5 minimal set of T m : 
D n (T m ;6)= sup P„((/~ 5 ) 2 ) 

the modulus of continuity of the Rademacher process / i-» n~ l J27=i e */(£0 
over Tm, where sx,...,e n are i.i.d. Rademacher random variables (i.e., Si 
takes the values +1 and —1 with probability 1/2 each): 



(t>n{F m ; 5) = sup 



n 



n 

i=l 
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Defining 



U n {F m - 5;t) := K ( n (JF ro ; P;cS) + D n (T m ;cS)J - + - 
\ V n n 



(where K,c> are numerical constants, to be chosen later), the local Rademacher 
complexity S^J 7 ^,; t) is (roughly) the smallest positive fixed-point of r i— > U n (J-' m ; r; 
More precisely, 

? n (J- ro ; t) := inf j 5 > s.t. sup j M^i^) | < ^ | (16) 
where q > 1 is a numerical constant. 



Corollary 1 (Strong margin adaptivity for local Rademacher complexities). 
There exist numerical constants K > and q > 1 suc/i t/iat the following holds. 
Let t > 0. Assume that a numerical constant L > exists and an event of 
probability at least 1 — Llog g (n/t) Card(At n )e exists on which 

7- 

VmeM n , pen(m) > -S n {T m \t) , (17) 

where S n (lF m ;t) is defined by (|15l) (and depends on both K and q). Assume 
moreover that the models (J 7 m ) meM are nested and 

m e arg min \ P n { fm) + pen(m) \ . 

m£M n L V / J 

Then, an event of probability at least 1 — [2 + {L + 1) log ? ( j ) ] Card(A4„)e -t 
exists on which, for every e € (0, 1), 

p(4- r )<^^{(l + J- +E (l + ^))p(/ m -r )+P enH 

(18) 

In particular, this holds when pen(m) = ■kS n {J : m )t), provided that K,c> are 
larger than some constants depending only on K,q. 

Corollary [T] is proved in Section 17.11 

Remark 2. One can always enlarge the constants K and q, making the leading 
constant of the oracle inequality (fT8|) closer to one, at the price of enlarging 
o"n{Fm]t) (hence pen(m) or 6 n (J-' m ;t)). We do not know whether it is possible 
to make the leading constant closer to one without changing the penalization 
procedure itself. 
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As we show in Section 15721 there are distributions P and collections of models 
(J 7 m ) meM such that this is a strong improvement over the "uniform margin" 
case, in terms of prediction error. It seems reasonable to expect that this happens 
in a significant number of practical situations. 

In Section [5j we state a more general result (from which Theorem [T] is a 
corollary) which suggests why it is more difficult to prove Corollary [T] when tp m 
really depends on m. This general result is also useful to understand how the 
nestedness assumption might be relaxed in Theorem [1] and Corollary [T] 



The reason why Corollary [T] implies strong margin adaptivity is that the local 
Rademacher complexities are not too large when the local margin condition is 
satisfied, together with a_complexity assumption on T m . Indeed, there exists 
a distribution-dependent 5 n (F m ;t) (defined as 5 n {F m ]t) with U n (T m \ S; t) re- 
placed by KiUn^m] K2S; t) for some numerical constants K\,K<2 > 0, related 
to K and c) such that 



Vm e M n , 



(s n (T m ;t) > 6 n (T m ; t) > S n (T m ; t) ) > 1 - 5 log, ( j ) 



(See Theorem 3 of [16].) This leads to several upper bounds on 8 n {F m \ t) under 
the local margin condition (|9]), by combining Lemma 5 of [16] with the examples 
of its Section 2.5. For instance, in the binary classification case, when T m is the 
class of 0-1 loss functions associated with a VC-class S m of dimension V m , such 



that the margin condition (0) holds with (p r< 
and e e (0, 1], 



(x) — h m x 2 , we have for every t > 



6n(? m ;t) <eP(f m -f*) + 



nh„ 



e-H 



r 2 v m In 



ne 2 h rl 

KiV„, 



(19) 



where K3 and K4 depend only on K. (Similar upper bounds hold under several 
other complexity assumptions on the models J- m , see |16j.) In particular, when 



each model S m is a VC-class of dimension V rn , 
V5 n (F m ;t) andt = ln(Card(A^„)) + 31n(n), (QU 



implies that 



pen(m) 



p{ffn-r 



< C inf 

m£A(„ 



P(fm-f) 



In ( Card(X») ) + ln(n) + V m In (enh m /V m ) 
nh m 



with probability at least 1 — Kn~ 2 , for some numerical constants C, K > 0. Up 
to some ln(rt) factor, this is a strong margin adaptive model selection result, 
provided that Card(A^ n ) is smaller than some power of n. Notice that the ln(n) 
factor is sometimes necessary (as shown by [23] ). meaning that this upper bound 
is then optimal. 



4. Lower bound for some non-nested models 

In this section, we investigate the assumption in Theorem []] that the models 
F m are nested. To this aim, let us consider the case where models are singletons 
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J~m = {fm}- Then, any estimator f m G J m is deterministic and equal to f m , so 
that model selection amounts to selecting among a family { f m s.t. to G M„ } of 
functions. Theorem [2] below shows that no selection procedure can be strongly 
margin adaptive in general. 

Theorem 2. Lei 7 be the 0-1 loss andJ- 0-1 :— {7 (it;-) s.t. u : X n- { 0, 1 } is measurable} 
be the associated loss function class. If Card^) > 2, two functions fo, fx G 
J r0_1 and absolute constants C^,C^ > exist such that the following holds. For 
every integer n > 2 and rh a selection procedure (that is, a function (X x y ) n 1— > 
M. = {0,1}), a distribution P exists such that 

P{M-n>T^- rmn \p(f m -n+v(m) + l ^-\)>C a 
m(n) me{o,i} L nh m J / 



(20) 

ln(n) me{0,l} |_ v / v / 7l/l m J 

(21) 



and E[P -/*)]> C f*^ min ( P (/ m - /*) +v(m) + lu( " 



where Vm G {0, 1} , v(m) := \ — - varp(/ m — /*) anc? h n ■ ''' "' ' 



var P (/ m - /*) 



Theorem[2]is proved in Section [7721 A straightforward corollary of Theorem[2] 
is that in the classification setting with the 0-1 loss, strong margin adaptive 
model selection is not always possible when the models are not nested. Indeed, 
when T m = {f m } for every m G M n = {0, 1}, (|20|) shows that for any model 
selection procedure to, some distribution P exist such that results like Theorem[TJ 
or Corollary [1] do not hold if t m = bx(n) for every to . 

Remark 3. 1. Theorem [2] (and its corollary for model selection) also hold for 
randomized rules to. : (X x y) n h> [0, 1] (where the value of m((Xj, 5^)i<i< n ) 
is the probability assigned to the choice of /1). Hence, aggregating models 
instead of selecting one does not modify the conclusion of Theorem [21 

2. The most reasonable selection procedure among two functions fo and 
fx (or two models {fo} and {fx}) clearly is empirical minimization. 
The proof of Theorem yields explicitly some distribution P, called Pi , 
such that (|20[) and (f2~Tj) hold for empirical minimization. Note that when 
models are singletons, most penalization procedures coincide with empir- 
ical minimization, for instance when pen(m) is proportional to the local 
Rademacher complexity 5 n (F m ;t), or to the ideal penalty pen id (TO.) = 
(P — P n ){fm — fm), its expectation or some quantile of pen id (m). 

3. Theorem [2] focuses on margin adaptivity with ip m (x) = h m x 2 , whereas 
the margin condition is also satisfied with other functions tp m . This is 
both for simplicity reasons, and because this choice emphasizes that one 
could hope for learning rates of order l/(nh m ) if strong margin adaptivity 
were possible. The meaning of Theorem [2] is then mainly that one cannot 
guarantee to learn at a rate better than 1/y/n, whereas for some model, 
the excess loss and l/(nh m ) both are of order 1/n . 
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4. The counterexample given in the proof of Theorem [2] is highly nonasymp- 
totic, since the distribution P strongly depends on n. If P and /o, /i 
were fixed, it is well known that empirical minimization leads to asymp- 
totic optimality, because (/m) me /Qi} ^ s nn ite and fixed when n grows. 
This illustrates a significant difference between the asymptotic and non- 
asymptotic frameworks. 

Another example of such a difference occurs when the number of candi- 
date functions (or models) is infinite, or grows to infinity with the sample 
size, see (iv) in Proposition [2] in Section [5721 



With Theorem [Q we have proven a strong margin adaptivity result for nested 
models, which holds true when the penalty is built upon local Rademacher com- 
plexities. Therefore, adaptive model selection is attainable for nested models, 
whatever the distribution of the data. On the other hand, Theorem[2]gives a sim- 
ple example where no model selection procedure can satisfy an oracle inequality 
(JTUJ) with a leading constant smaller than Ci^/n/ '(ln(n)) . 

Looking carefully at the selection problems considered in the proof of Theo- 
rem [2j it appears that the main reason why they are particularly tough is that 
we are quite "lucky" with one of the models: it has simultaneously a very small 
bias, a very small size and a large margin parameter, while other models with 
very similar appearance are much worse. When looking for more general strong 
margin adaptivity result, we then must keep in mind that this is a hopeless task 
in such situations. 



Let us finally mention a related result in a close but slightly different frame- 
work. In the classification framework, under a global margin condition with 
<p{x) oc x 2k with k > 1, Theorem 3 in [18] shows that for any M n > 2 , a family 
(u m ) meM of M n classifiers exists for which, for any selection procedure in, 
some distribution P exists such that 

E[P(/a -/*)]> mf {P(f m -f*)} + c( 

where f m — "f{u m ; •) for some loss function 7. When m is (penalized) empirical 
minimization, the remainder term is shown to be as large as C y/ln(M n )/n when 
the margin condition holds with k > 1. 

This result and Theorem [2] focus on different problems. In [18], the margin 
condition is only assumed to hold globally, and the focus is on the dependence of 
the remainder term on the cardinality M n of M n . Therefore, the counterexam- 
ple given in [T5] implies nothing about local margin conditions for (f m )meM„ ■ 
Note that using these arguments, we could probably generalize Theorem [2] to a 
family of M n > 2 functions and obtain a lower bound depending on M n as in 
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5. General collections of models 

As proven in Section |4l we cannot hope to obtain margin adaptivity without any 
assumption on cither P or the models. The purpose of this section is to explain 
what can still be proven in the general case, and why this is weaker than our 
Theorem Q] 



5.1. A general oracle inequality 

We start with a general result for penalties satisfying the lower bound (fT2")l . 

Theorem 3. Let (J- m ) m eM n ^ e an D countable family of models, and (t rn ) me7 vf n 
be any sequence of positive numbers. Let in be defined by and assume that 
some c € (0, 1) exists such that 

Vm e M n , (1 - c) pen(m) > (P - P n ) ( f m - f m ) + ^ > (22) 

on an event of probability at least 1 — n. 

Then, there exists an event of probability at least 1 — r\ — 22 me jM e~ tm on 
which the following holds: for every e € (0, 1), 

P(ffn- r)<-^— inf {p(f m -f*)+pen(m)+v(m) + ^}+V n 
V / 1 — £ meM n [ 6n J 

(23) 

where V n := sup { v(m) — eP (f m — /*) — cpen(m) } 

1 ~ £ me.M„ 



and „(m) := var F (/ m - /*) . 

Theorem [3] is proved in Section [TTTI Let us make a few comments. 

First, without V n , (|2"3")) is the kind of oracle inequality we are looking for, 
since the leading constant is close to 1 (provided e is small enough). For the 
sake of simplicity, assume that a margin condition @ holds for every model 
m G with (p m (x) = h m x 2 . Then, 



I \ ^ \ ^niP ( fm f*) r, I t i* 

v{m) < \ r < eP{f m - f 



2eh r: 



for any e £ (0, 1). Hence, the first term of the right-hand side of (|23"|) is smaller 
than 

lH ~ : inf |p(/ m -/*)+pen(m)+ '"' 



1 — e meMn \ 2eh m n 3n 

which is the right-hand side of a margin adaptive oracle inequality like ^ (at 
least when the penalty is itself of the right order). A similar result holds for a 
more general v? m ; see the proof of Theorem [TJ 
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Once we have a penalty satisfying (|22|) (for instance, a local Rademacher 
penalty), the main difficulty for proving a strong margin adaptivity result then 
lies in V n . It arises from the difference between the ideal penalty and the right- 
hand side of the lower bound ([22]) , that is (P — P n )(fm)- This random quantity 
is centered, and (up to a quantity independent of m) has deviations of order 
v(m), Bernstein's inequality being unimprovable. Then, if v(m) happens to be 
much larger than P(f m — /*)+pen(m), m is selected with a positive probability, 
whatever the value of P(f m — I*) ■ in that case, the expectation of f m is worse 
than the oracle by at least v(m) (for any of these "bad" models). Hence, V n 
certainly is unavoidable in (|23|) . 

As shown by Theorem [21 V n can be much larger than the expectation of a 
strong margin adaptive estimator. Nevertheless, V n is not always the main term 
in the right-hand side of (|23|) . Let us now describe a set of favorable situations, 
in which it is possible to prove that V n is small enough. 

1. Models are nested, t m is nondecreasing (with respect to the inclusion or- 
dering on J- m ), and pen satisfies the additional condition (fT3j) : see Sec- 
tion [3j 

2. Models are nested, t m is nondecreasing and v(m) is decreasing (or at least 
not increasing too much) when J- m increases. Indeed, let us fix m, m* € 
Ai n (think of m* as a minimizer of the infimum in the right-hand side 
of (|23p ). When models are nested, either F m * C T m so that v(m) < 
sup,F ml cJw { v ( m ') } i or Fm C T m * so that (p m * < tp m hence ip* m < tp mi , . 
In the second case, 

»M- e PUW*) < A (0) < A. < v"„- (0) 

since f OT < t m * and (/?^* is nondecreasing. As a consequence, for any 
m* e M n , 



K < max < sup { v(m') } ; tp m * 

which is not too large provided that v(m) never increases too much. Notice 
that we can understand assumption (|13[) as ensuring that the penalty 
compensates a possible increase of v(m). 

3. The oracle model prediction error does not decrease to zero faster than 
n -1 / 2 and t m < t. Indeed, the straightforward upper bound v(m) < 
\j2t m jn shows that V n < (1 — e) -1 y/2t/n. 

4. The margin condition does not depend on m and £ m < t. Indeed, when 
ip m ee if (or inf m <p m >p), we have 

"•^jss.{*(^)}^tV(i/I) ■ 
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5. The penalty satisfies cpen(m) > v(m) for every m £ A4 n , which can 
be ensured for instance by adding c v(m) (or an estimate of it) to a 
penalty satisfying (f22l) . This method is for instance the one proposed by 
Koltchinskii [16] (Section 5.2), and in that case (|23l) coincides with his 
Theorem 6. 

Points 3 and 4 above show that the challenging situations are the ones where 
the margin condition indeed depends on the model, and fast rates of estima- 
tion are attainable. We prove in Section 15.21 that such situations can occur, 
enlightening how our Theorem [T] is an improvement on existing results and 
their straightforward consequences. 

On the other hand, point 5 may seem contradictory with the negative results 
of Section 2J The explanation is that using v(m) in the penalty means that 
to is not only a function of the data, but also of the unknown distribution P. 
Then, it cannot be considered adaptive. A more surprising consequence of this 
remark combined with Theorem [2] is that v(m) cannot be estimated accurately 
enough uniformly over the set of all distributions P. Consider the proposal, in 
Section 5.1 of Hi . to add 

°\ n 

to the penalty, which is sufficient to give a result like (|14l) . The point is that 
such a penalty is generally much too large (at least for small models), which 
often results in an upper bound of order n -1 / 2 . In the examples we have in mind 
(as well as in the counterexamples of Section @|, the excess risk of the oracle is 
much smaller, typically of order n~P for some /3 £ (1/2; 1]. 

5.2. The local margin conditions can be significantly tighter than 
the global one 

In this section, we show that there exist challenging situations, in which the 
margin condition holds for functions <p m strongly depending on to. 

Proposition 2. Let k £ (l;+oo) and assume that X is infinite. Let 7 be the 
0-1 loss and J-" 0-1 := {j(u;-) s.t. u : X {0,1} is measurable} be the as- 
sociated loss function class. Then, there exist a probability distribution P on 
X x {0,1}, a sequence (fj)jgs of elements of J 7 °~ 1 , and positive constants 
{Ci)§<i<7 {depending on k only) such that 

(1) Vk £ N, P(W-/*) = P(f2k-r) - b(k) and2- k ^ 2 < b(k) < 2" fe «- 1 . 
(ii) The global margin condition ([S]) is satisfied over T = J" " 1 with ip{x) = 

C 5 x 2k , and it is tight: VieN,^ V var (/2fe+i - /* ) ) > C , 6 P(/ 2 fc+i - 
/*) ■ 

(in) A tighter local margin condition (JHJ) holds over {/2/c s.t. k £ N} : Vfc £ N . 
P(f2k - /*) < var P (/ 2fc - /*) . 
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(iv) For every m £ N, define T m = {f m } o,nd consider the model selection 
problem among {F m ) < m <Af with M n > 2hi2(n) . Then, the right-hand 
side of a strong margin adaptive oracle inequality of the form (| X0[) is at 
most proportional to 

0<2fc<M„ [ n J n 

whereas the right-hand side of a global margin adaptive oracle inequality 
is larger than Cjn~ K /( 2K ~ 1 ) ^> (\n(n))/n. 

Proposition [5] is proved in Section 17.31 It gives an example of a model se- 
lection problem where strong margin adaptivity implies a faster rate of con- 
vergence than adapativity to the global margin condition. Note that the same 
argument works with many other model selection problems, such as selecting 
among ({/ 2 fc+i s.t. < k < m}) me{h (ln(n))2} . 

6. Discussion 

6.1. Other penalization procedures 

Wc have focused in Section l3~2"l on penalties defined in terms of local Rademacher 
complexities, in order to prove that strong margin adaptivity is attainable for 
some data-driven penalties. An interesting question is whether such a result can 
be extended to penalties that can be computed faster. 

For instance, it is natural to think of estimating pen id (m) itself by resampling, 
instead of the local complexity 8n(F m \t). Such penalties, with several kinds 
of resampling schemes, have been proposed in [2J [3] and called "Resampling 
Penalties" (RP), generalizing the bootstrap penalty suggested by Efron [T3] . 
RP can be computed faster than local Rademacher complexities, because they 
are not defined as a fixed point of the resampling estimate of a function. In 
particular, the T^-fold penalties defined in [2] have the same computational cost 
as V-fold cross-validation. 

In addition, RP are easy to calibrate, since they depend on a single tun- 
ing parameter — the multiplicative factor in front of it — which can for instance 
be estimated from the data by using the "slope heuristics" (see [4]). On the 
contrary, local Rademacher complexities depend on two more constants, whose 
theoretical values are certainly too large for practical application. 

Extending Corollary Q] to RP would require to prove that RP satisfy both 
assumptions (|T2"j) and (flU)) . On the one hand, (fT2"]) means essentially that the 
penalty is larger than the expectation of the ideal penalty with large probability. 
Hence, one can conjecture that (fT2|) holds for RP; a partial proof of (fT2| for RP 
in our general setting can be found in Chapter 7 of p] , together with an agenda 
for a complete proof, which seems to be a difficult theoretical problem. On the 
other hand, (fT3"|) seems less likely to hold for RP, and we may have to modify 
RP so that (fT3f can be satisfied in general. 
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Proving such results would be quite interesting, since it would provide a 
strong margin adaptive penalization procedure with a reasonably small compu- 
tational cost. 

6.2. Should we make collections of models nested? 

A natural question coming from our results is whether one should make any 
collection of models nested before performing model selection, in order to im- 
prove performance. Let us consider the counterexample of Theorem [2] and look 
at what would happen if we make the models nested. 

Assume that P — Pi is the distribution defined in the proof of Theorcm[2l On 
the one hand, comparing { f } and { /o , /i } , the model selection problem would 
be easy because the margin parameter h m is the same in both models, making 
the remainder term of order n -1 / 2 (the remainder term (n/i m ) _1 can be replaced 
by n -1 / 2 when h m < n~ x / 2 because of the upper bound varp(/ m — /*) < 
1/4). And margin adaptivity is not challenging when the margin condition is 
merely not satisfied. On the other hand, when P = Pi, comparing {/i} and 
{ fo , f\ } is more challenging because f\ is really better than f ■ Here, contrary 
to the non-nested case, the large increase of the term varp(/ m — /*) induces a 
similar increase in the L 2 (Pi) diameter of the class. Hence, local Rademacher 
complexities can detect it, as shown by Theorem [T] 

To conclude, improving significantly the prediction performance of the final 
estimator by making the models nested requires some prior knowledge, such 
as a natural ordering between the (non- nested) models. Otherwise, Theorem [2] 
shows that choosing from data — or randomly — how to make the models nested 
is not successful with probability at least C3 > , whatever the sample size. 

7. Proofs 

7.1. Oracle inequalities 

We give the proofs in a logical order, that is, first Theorem [3j then Theorem [T] 
(which is a corollary of it), and finally Corollary [TJ 

proof of Theorem^ First, by definition of to, for every to G M n we have 



P(ffh-f*) + (Pn - P) (ffh ~ffh) + (Pn - P) [ffh - f* ) + pen(m) 



P n (ffh)+ pen(m) < P n ( f m ) + pen(m) , 





which can be rewritten as 




<P(.f, 



m 



r ) + (p n -p)(f m -n+ p en ( TO ) 
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On the event where (|22|) holds, we then have 

P(f m ~r)+ (Pn - P) ( frn ~ f* ) + cpen(m) + ^ 

V / n (24) 

< inf {P(f m -t) + (P n -P)(f m -t)+pcn(m)} . 

By Bernstein's inequality (see for instance Proposition 2.9 in [22]), for every 
m G Ai n , there is an event of probability 1 — 2e~* m on which 

\{Pn-P)Um-n\<v{m) + t ^ i . 

On the intersection of these events with the one on which (|2"2"|) holds, we derive 
from (HU) that 



P 



(fm~ f*)-v(m)+cpen(fh) < inf i P ( f m - f* ) + pen(m) + v(m) + 
For any e > 0, the left-hand side is larger than 

(1 - e)P (ffh -f*)+ eP(ffh -f) + cpcn(m) - v(m) 
>(l-e)p(f,n-r)- sup {v(m)-eP(f m -f*)-cpen(m)} . 

^ ' m€.M n 

The result follows. □ 

proof of Theorem^ We consider the event on which (|23p holds. By Theorem|31 
we know that it has probability at least 1 — r\ — 2 ^2 m eM 
the first term in the right-hand side of (|23|) . From (|9|), we have 



V 71 

Then, using that xy < <p m {x) + Vmiv) f° r every x,y > 0, 



VmeM„, W (m)<^fy^j +^ m ( e ^(P(/ m -/*))) . 

Since <^ m is convex with ip m (0) — 0, we have ip m (\x) < \tp m (x) for every 
A G (0, 1) and x > 0. Then, using also that varp(/ m — /*) < 1, 



Vm€Mn, v(m)<mmi^^, V tnU^J+^P(frn-n \ , (25) 
and the right-hand side of (|23|) is smaller than 



' inf (1 + e)P (f m - f* ) + pen(m) + min ^ L & , J*™ + M +K 



1 V — 1 J \ tl O J i V I 1 i 7- 777 I 1 1 o I ? 1 / I 1 . \ 

— e meMn \ \ V e n / V n on 

(26) 

It now remains to upperbound V n . 

Let m,m' G M. n - Since models JF m are nested, two cases can occur: 



imsart-generic ver. 2007/09/18 file: margin.tex date: February 8, 2010 



Arlot, S. and Bartlett, P. /Margin- adaptive model selection 



21 



1. T m C T m >, which implies t m < t m , and tp m > ip m >, hence ip* m < ip* n ,. 
Using in addition (|25l) and that <p* , is nondecreasing , we have 



2. T m > C T m . Using CEO) and (gSJ), 

v(m) < CiwK) + C 2 P(/ m - - /*) + cpcn(m) 



< Cl min ^/^, <p* m , [ ife H + (C 2 + C 1 e)P(/ m , - /*) + cpen(m) 



Therefore, 



and the result follows. □ 

proof of Corollary^]] From [16 (Theorem 1 and (9.2) in the proof of its Lemma 2), 
we know that there exist numerical constants K > and q > 1 such that (|12l) 
holds with t m = t, c = 5/7 and 77 = (i + 1) ln g (f ) Card(7W„)e~*. 

In addition, Lemma [3] below shows that (|13l) holds with Ci = V2 and C2 = 
2/(Kq). 

The result follows from Theorem [T] with t m = t. □ 



Lemma 3. Let T m i C J-m otic? 5 n be defined by ()15p . Then, 

v(m) < 2S n (P m :t) + V2v(m') + 2P(/m _~ 1^1 . (27) 

qK 

proof of Lemma[3l Since J- m i C J-m, / m ' G P m (as well as / m ), so that 

D P (P m ; P(f m > - f m )) > \J P{f m - fm') 2 > V varp ( f m - f m , ) 

. / varp ( f m - f* ) , — — 

> y 2 V var P ( /m' - / ) • ( 28 ) 

For the last inequality, we used that var(X) < 2 var(X + Y ) + 2 var(F ) for any 
random variables X, Y , and the inequality y/x + y < y/x + ^/y for every x, y > 0. 
First, assume that the lower bound in (|28p is nonpositive. This implies 



2t 

;(m) = \ — varp ( / m - /* ) < %/2u ( m' ) 



so that EH) holds. 
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Otherwise, the assumptions of Lemma |4] below hold with 



D = y - ( ^ - } - Vvarp ( / m , - /* ) > and a = P(f m , - f m ) 
We deduce from (EH that 



v(m) v(m>) P(f m >-f m ) , T , P(/ m > - /*) 

^ V2 (?A 

and (EH) holds also. □ 



Lemma 4. Let <5 n (J-" m ; i) be defined by (|15j) . Assume that there is some Dq, do > 
suc/i t/iat Dp[Tm\ Co) > -Do- Then, we have the following lower bound: 

max|5 n (JF m ;t);^| > -Do^ ■ (29) 

proof of Lemma^ First, (p29")l clearly holds when > Doy/t/n. Otherwise, let 
(Ti = max { gif , 1 } Do y/t/n > cto . From the definition of U n , we have 

U n {T m ; oi; t) > KD P {F m -ai) IT > KD a [I=l > J_ 



Then, according to the definition (TT5)) of <5„(J r m ;t), 8 n {Pm',t) > fi > Doy/t/n 
and the result follows. □ 



7.2. Lower bounds (proof of Theorem P?|) 

For every m e { 0, 1 } , let / m : (x, y) i-> l^m ; / m G J 70-1 since / m = 7 (u m ; • ) 
where for every x £ X , w m (a;) = to. Let a = (2n) _1 and /i = (2n) -1 / 2 . Let 
a ^ He any two elements of X . We define a probability distribution Pi on 
X x {0,1} as follows: if (X,Y) ~ Pi , then F(X = a) = a, ¥(X =b) = l-a, 
P(y = l| X = a) = and P (y = 1 1 X = 6) = \ + h. We also define P 
as the distribution of (X, 1 — Y) where (X, Y) ~ Pi . In the following, for 
any distribution Q on X x { 0, 1 } , we use the notation Pq as a shortcut for 

IP (.X i ,Y'0i<<<n~Q®'' • 

First, under distribution Pi , the Bayes predictor is s = If, , 
PiUo-f*) = 2(1 -a)h , P 1 (f 1 -f*) = a and var Pl ( A - /* ) = a - a 2 
Hence, 

min \ Pl (f m -f*) + v(m) + ] -P- 
me{o,i( I n/i m 



. pff *,- flU Mn). | / 2aln(n) ln(n) 2 + 3ln(n) 

< Pl(/l - / ) + »1 + < a + V h < ^ 

n/ii V rt n 2n 
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Therefore, if P Pl (m = 0) > C 3 , then O holds when P = P 1 , with C* 4 = 1/3 . 
Similarly, P Po (m = 1) > C 3 implies with P = P a and C 4 = 1/3. So, in 
order to prove (|20p . we only need to prove that 

max {Pp. (m = 1 - j)} > C* 3 > . (30) 
je{0,i} 



The proof of (|30|) relies on three main facts. First, 
Vje {0,1}, F Pj (\fi,X i = b) = (l-a) n =(l-± i ) >\. (31) 

Second, for every j e {0,1}, under Pj , conditionally to {Vi , Xi = b} , 
Card{i s.t. Yi = 1} is a binomial random variable with parameters (n,pj) , 
where 

^=P(X^p J (^ = l) = ^ + (-l)-'' +1 ^ • 
So, Lemma[5]shows that for every j G {0,1} and every k £ Nn [ § — y/n, ^ + j ; 

C 

Pp. (Card{i s.t. Y t = 1 } = k I Vi , X ( = b) > —= > , (32) 

where C is an absolute constant. 

Third, let us define, for every k € {0, . . . , n} , 

7T fe :=Pp Lf (TO((X i ,K i )i< l <„) = 1| Card{is.t. 5^ = 1 } = fe and \/i , Xi = b) , 

where is the uniform distribution on { a, b } x { 0, 1 }. A crucial remark is that 
Pu can be replaced by either Pq and P\ in the definition of , since the condi- 
tioning event determines (Xi, Yi)i<i< n up to the ordering of the observations; in 
the definition of irk , the probability only refers to the ordering of the (Xi, Yi) , 
and any product measure on X x {0, 1} assigns equal probabilities to the n! 
permutations of the n observations. Note also that the definition of TTk stays 
valid when to, is a randomized selection rule, which proves the generalization of 
Theorem [2] pointed out in Remark [3] For any given selection rule to , 



Card UeNfl 



n i— n 

v n, — 

2 V ' 2 



1 

S.t. 7Tfc > - 



is either larger or smaller than ^/n . If it is larger, (|31[) . (|32|) and the definition 
of the 7Tfc (with Po instead of Pu) show that 

P Po (fh((X i ,Y i ) 1<i<n ) = l)>yHx-^xi = ^- = C 3 >0, 

so that (|30[) is satisfied. Otherwise, choosing Pi instead of Po shows that (|30[) 
holds true. This proves ([2U)) . which clearly implies (|2"Tj) . since P(/m — /*) > 
a.s. □ 
A key tool in the proof of Theorem [5] is the following uniform lower bound 
on the density of the binomial distribution w.r.t. the counting measure on N. 
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Lemma 5. For every n £ N and p £ [0,1], let B(n,p) denote the binomial 
distribution with parameters (n,p). For every a, b > and c £ (0, 1/2), a positive 
constant C(a, b, c) exists such that for any positive integer n , 

inf {V^ z „B {ntp) (Z = k)}>C(a,b,c)>0 . (33) 

fceN, fe-f |<min{ cm 1 / 2 ,-! } 
p— ^ <min-f bn~ - 1 / 2 ,c } 

proof of Lemma\^ Let n,k,p satisfy the above conditions, Z ~ B(n,p), and 
define 

2fc c 1 

r) := 1 5 :=p- - . 

n 2 

The assumption on k and p becomes |ry| < min { an -1 / 2 , 1/2 } and |<5| < min { &7i -1 / 2 , c }. 
In addition, 

n\ ( 1 . \ ' /' I ' ' 



{Z = k)=p\l-pY-H'\ = [t + 5\ [t+6 



^kj \2 J \2 J k\{n-k)\ ' 
We now use Stirling's formula: 

ln(n!) = n\n(n) — n + — ln(27rn) + e n 

for some sequence e„ — > when n — > +oo (one has (12n + l) _1 < e„ < (12n) _1 ). 
Then, 

lnP(Z = k) = fcln [ - + 5 ] + (n - jfe) In ( - -6 J + In ■ 



2 / v V 2 / fe!(n-fc)! 

N , / 1 - 26 \ , , , ( 1 + 26 
(I-??) In +(1 + 77) In' 



n 

^ V " V 1 _ V J ' " V 1 + '/ 
- i ln(n) + ~ In f ~ j - i In ( 1 - ?/ 2 ) + e n - e k - e„_ fc . 

Define h : (-1, +00) n^Rby h(x) := ln(l + z) — 1, so that 

Vsg>-1, ln(l + 1) = x ( 1 + h(x) ) . 

Recall that \h(x)\ < 2 \x\ as soon as x > —1/2, by the Taylor-Lagrange formula. 
In particular, lim^^o h(x) = 0. We then have 

LnP(Z = fc) 

= I [4c5ti - 2?7 2 - 26(1 - r])h(-26) + 77(1 - r])h(-r)) + 26(1 + rf)h(26) - r)(l + 77)^(77) ] 

1 1 / 2 \ ?7 2 

- - In(n) + - In - + —h(-rf) +e n ~e k - £ n -k ■ 



Assuming that n > uq such that max {a^jn- 1 / 2 < 1/2, it follows that 
lnP(Z = k) = -iln(n) + R(k,n,p) 
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with 

R(k,n,p) > L (l + a 2 + ab + b 2 ) 

for some numerical constant L > 0, and this lower bound is uniform over n > no 
and k,p such that the conditions of the infimum in (|3"3")l are satisfied. On the 
other hand, 

inf {V z ^ B{n , p) (Z = k)}>K(p)>0 

n<no, l<fc<n 

as soon as p 6 (0, 1). Since P.z~Z3(n,p) (Z = k), seen as a function of p, is in- 
creasing on (0, k/n) and decreasing on (k/n, 1), K(p) is uniformly larger than 
mm{K(l/2-c),K(l/2 + c)}. The result follows. □ 



7. 3. Proof of Proposition [H 

Let (xj)j gN be any infinite sequence of distinct elements of X and A > to 
be chosen later. We define P as follows, by denoting (X, Y) a pair of random 
variables with joint distribution P . For every k £ N, V(X — x 2k ) — PkQk & n d 
P(X = x 2 k+i) = Pfe(l — Qk) ! where = 2~ fc_1 and G [0, 1] is to be chosen 
later; note that J2 keN Pk = 1 ■ For every fc e N, P(Y~ = 1 | X = x 2k ) = and 
P (Y = 1 1 X = x 2k+1 ) = (1 + S k )/2 where S k — 2~ kx . As a consequence, the 
Bayes predictor is s := l{ X2fc+1 s .t. fcGN} ■ Let us define for every j e N , 

J s(x) ifx^Xj 

u A x )'-=\, ,x . r and Jj = nf[uji-) 

I 1 — s(x) it a; = Xj 
where 7 is the 0-1 loss. Then, for any fcGN, 

P(hk+i -/*) =Wl-9k) P(f2k~ f*)=PkQk (34) 

varp (/2/c+i - /*) = Pfe(l - 3fc) - (4]3fe(l - 9fe)) 2 (35) 

varp ( / 2 fe - /* ) = p/c<7/c - (pfc^fc ) 2 • (36) 
We can now prove the four statements of Proposition O 

(i) By choosing q k = S k /(1 + 5 k ) and A = k — 1 > implies (i) with 
b(k) = p k qk ■ 

(ii) For every t £ (0,1) , 

¥(\2 V (x)-i\<t) = j2nx = x 2k+1 )i Sk < t < J2 2- fc - i <t i / A 

fcGN k s.t. 2- fcA <t 

(37) 

By Lemma 9 of 9\ , (|37|) implies the global margin condition over T° 1 with 
function <p(x) — C§x 2 ( x+1 ^ , where C5 only depends on A . This implies the 
first part of (ii) since A = k— 1 > 0. For the second part, (|33|) implies that 

I -c -c* "\ ^ / -1 \ / -, \ ^ -Pfc(l — 9fc) ^ Pfc o-fc-3 

varp(/ 2fe+ i -/*) >Pk(l~qk)(l - Pk) > 5 -~Z~ 



hence the second part of (ii) holds with Cq = C$2 



2-3k 
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(iii) By O, var P (/ 2fe - /*) = p k q k (l - p k q k ) < p k q k = P(f 2k - /*) . 

(iv) By (iii), for every k € N , a local margin condition holds on T^k with 
function ip2k : x ^ x 2 . So, the right-hand side of a strong margin adaptive 
oracle inequality is at most (keeping only even values of m) proportional 
to 

inf (p(/ 2fc -r) + ^l<2- 1 -(«)- 1 + ^<^M . 



0<fc<M„/2 

Note that the ln(n) factor may be replaced by a smaller quantity depend- 
ing on the framework. The last statement on global margin adaptivity 
holds according to (ii), since <p*(x) = L(k)x 2k /( 2k ~ 1 ) , where L(n) > 
only depends onu. □ 
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